Part I - Google PlayStore App Data¶

by Moyinoluwa Busola Agbaje¶

Introduction¶

In this project, I will be investigating some of the data gotten from the Google PlayStore analytics. This data is provided by companies like Sensor Tower or App Annie, which helps to drive development and App marketing strategies to succeed for many companies. It's really a BIG THINGY!

Data source:¶

The App and review data was scrapped from the Google PlayStore by Lavanya Gupta in 2018. The original files are listed here.

Preliminary Wrangling¶

In [1]:
# import all packages and libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sb
import plotly.express as px

%matplotlib inline
In [2]:
# Show numeric output in decimal format (e.g. 2.15)

pd.options.display.float_format = '{:,.2f}'.format
In [3]:
# load data

df_gapps = pd.read_csv("googleapps.csv")
df_gapps.head()
Out[3]:
App Category Rating Reviews Size_MBs Installs Type Price Content_Rating Genres Last_Updated Android_Ver
0 Ak Parti Yardım Toplama SOCIAL NaN 0 8.70 0 Paid $13.99 Teen Social July 28, 2017 4.1 and up
1 Ain Arabic Kids Alif Ba ta FAMILY NaN 0 33.00 0 Paid $2.99 Everyone Education April 15, 2016 3.0 and up
2 Popsicle Launcher for Android P 9.0 launcher PERSONALIZATION NaN 0 5.50 0 Paid $1.49 Everyone Personalization July 11, 2018 4.2 and up
3 Command & Conquer: Rivals FAMILY NaN 0 19.00 0 NaN 0 Everyone 10+ Strategy June 28, 2018 Varies with device
4 CX Network BUSINESS NaN 0 10.00 0 Free 0 Everyone Business August 6, 2018 4.1 and up
In [4]:
# structure and information of  the dataset

print(df_gapps.shape)
print(df_gapps.columns)
print(df_gapps.info())
(10841, 12)
Index(['App', 'Category', 'Rating', 'Reviews', 'Size_MBs', 'Installs', 'Type',
       'Price', 'Content_Rating', 'Genres', 'Last_Updated', 'Android_Ver'],
      dtype='object')
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10841 entries, 0 to 10840
Data columns (total 12 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   App             10841 non-null  object 
 1   Category        10841 non-null  object 
 2   Rating          9367 non-null   float64
 3   Reviews         10841 non-null  int64  
 4   Size_MBs        10841 non-null  float64
 5   Installs        10841 non-null  object 
 6   Type            10840 non-null  object 
 7   Price           10841 non-null  object 
 8   Content_Rating  10841 non-null  object 
 9   Genres          10841 non-null  object 
 10  Last_Updated    10841 non-null  object 
 11  Android_Ver     10839 non-null  object 
dtypes: float64(2), int64(1), object(9)
memory usage: 1016.5+ KB
None

Data Cleaning¶

  • Remove unused columns, NaN values and Duplicates
  • Remove unwanted symbols and convert data into numeric format.
  • Wrangle columns containing nested data with Pandas.
In [5]:
# remove unused columns "Android_Ver" and "Last_Updated"

df_gapps = df_gapps.drop(columns=["Last_Updated", "Android_Ver"], axis=1)

df_gapps.head()
Out[5]:
App Category Rating Reviews Size_MBs Installs Type Price Content_Rating Genres
0 Ak Parti Yardım Toplama SOCIAL NaN 0 8.70 0 Paid $13.99 Teen Social
1 Ain Arabic Kids Alif Ba ta FAMILY NaN 0 33.00 0 Paid $2.99 Everyone Education
2 Popsicle Launcher for Android P 9.0 launcher PERSONALIZATION NaN 0 5.50 0 Paid $1.49 Everyone Personalization
3 Command & Conquer: Rivals FAMILY NaN 0 19.00 0 NaN 0 Everyone 10+ Strategy
4 CX Network BUSINESS NaN 0 10.00 0 Free 0 Everyone Business
In [6]:
# check for NaN values especially in 'Rating' column

df_gapps['Rating'].isna().sum()
Out[6]:
1474

Insight¶

  • There are 1474 NaN values in the Rating column, so these values need to be dropped.
  • In [7]:
    # drop the NaN values
    
    df_gapps_clean = df_gapps.dropna()
    df_gapps_clean.shape
    
    Out[7]:
    (9367, 10)
    In [8]:
    # check for duplicate records
    
    df_gapps_clean.duplicated().values.any()
    
    Out[8]:
    True
    In [9]:
    # check for duplicates especially the 'Instagram' App
    
    df_gapps_clean[df_gapps_clean['App'] == "Instagram"]
    
    Out[9]:
    App Category Rating Reviews Size_MBs Installs Type Price Content_Rating Genres
    10806 Instagram SOCIAL 4.50 66577313 5.30 1,000,000,000 Free 0 Teen Social
    10808 Instagram SOCIAL 4.50 66577446 5.30 1,000,000,000 Free 0 Teen Social
    10809 Instagram SOCIAL 4.50 66577313 5.30 1,000,000,000 Free 0 Teen Social
    10810 Instagram SOCIAL 4.50 66509917 5.30 1,000,000,000 Free 0 Teen Social

    Insight¶

  • There are four (4) duplicate entries on Instagram App.
  • In [10]:
    # to completely remove duplicates, I need to explicitly specified the subsets
    
    df_gapps_clean = df_gapps_clean.drop_duplicates(subset=["App", 'Type', 'Price'])
    
    # check the Instagram App
    
    df_gapps_clean[df_gapps_clean['App'] == 'Instagram']
    
    Out[10]:
    App Category Rating Reviews Size_MBs Installs Type Price Content_Rating Genres
    10806 Instagram SOCIAL 4.50 66577313 5.30 1,000,000,000 Free 0 Teen Social

    Insight¶

  • There is only one Instagram entry in the dataset.
  • In [11]:
    # type conversion
    
    # remove "," form Installs values
    
    df_gapps_clean.Installs = df_gapps_clean["Installs"].astype(str).str.replace(",", "")
    
    In [12]:
    # convert to numeric
    
    df_gapps_clean.Installs = pd.to_numeric(df_gapps_clean["Installs"])
    

    Find the most expensive Apps, filter out the Junk, and calculate a (ballpark) Sales Revenue Estimate¶

  • Taking a deeper dive into the 'Price' column.

    </li>

  • In [13]:
    # remove the "$" sign
    
    df_gapps_clean['Price'] = df_gapps_clean['Price'].astype(str).str.replace("$", "")
    
    # convert to numeric
    
    df_gapps_clean['Price'] = pd.to_numeric(df_gapps_clean['Price'])
    
    C:\Users\Moyinoluwa Busola\AppData\Local\Temp\ipykernel_13348\1423137204.py:3: FutureWarning: The default value of regex will change from True to False in a future version. In addition, single character regular expressions will *not* be treated as literal strings when regex=True.
      df_gapps_clean['Price'] = df_gapps_clean['Price'].astype(str).str.replace("$", "")
    
    In [14]:
    # check for the most expensive Apps
    
    df_gapps_clean[df_gapps_clean["Price"]> 250]
    
    Out[14]:
    App Category Rating Reviews Size_MBs Installs Type Price Content_Rating Genres
    1331 most expensive app (H) FAMILY 4.30 6 1.50 100 Paid 399.99 Everyone Entertainment
    1946 I am rich (Most expensive app) FINANCE 4.10 129 2.70 1000 Paid 399.99 Teen Finance
    2193 I am extremely Rich LIFESTYLE 2.90 41 2.90 1000 Paid 379.99 Everyone Lifestyle
    2394 I am Rich! FINANCE 3.80 93 22.00 1000 Paid 399.99 Everyone Finance
    2461 I AM RICH PRO PLUS FINANCE 4.00 36 41.00 1000 Paid 399.99 Everyone Finance
    2775 I Am Rich Pro FAMILY 4.40 201 2.70 5000 Paid 399.99 Everyone Entertainment
    3114 I am Rich FINANCE 4.30 180 3.80 5000 Paid 399.99 Everyone Finance
    3145 I am rich(premium) FINANCE 3.50 472 0.94 5000 Paid 399.99 Everyone Finance
    3221 I am Rich Plus FAMILY 4.00 856 8.70 10000 Paid 399.99 Everyone Entertainment
    3554 💎 I'm rich LIFESTYLE 3.80 718 26.00 10000 Paid 399.99 Everyone Lifestyle
    3856 I am rich VIP LIFESTYLE 3.80 411 2.60 10000 Paid 299.99 Everyone Lifestyle
    3897 I Am Rich FAMILY 3.60 217 4.90 10000 Paid 389.99 Everyone Entertainment
    3946 I'm Rich - Trump Edition LIFESTYLE 3.60 275 7.30 10000 Paid 400.00 Everyone Lifestyle
    4606 I Am Rich Premium FINANCE 4.10 1867 4.70 50000 Paid 399.99 Everyone Finance
    5765 I am rich LIFESTYLE 3.80 3547 1.80 100000 Paid 399.99 Everyone Lifestyle
    In [15]:
    # remove Apps which price are greater than 250 
    
    df_gapps_clean = df_gapps_clean[df_gapps_clean['Price'] < 250 ]
    df_gapps_clean.sort_values('Price', ascending=False).head()
    
    Out[15]:
    App Category Rating Reviews Size_MBs Installs Type Price Content_Rating Genres
    2281 Vargo Anesthesia Mega App MEDICAL 4.60 92 32.00 1000 Paid 79.99 Everyone Medical
    1407 LTC AS Legal MEDICAL 4.00 6 1.30 100 Paid 39.99 Everyone Medical
    2629 I am Rich Person LIFESTYLE 4.20 134 1.80 1000 Paid 37.99 Everyone Lifestyle
    2481 A Manual of Acupuncture MEDICAL 3.50 214 68.00 1000 Paid 33.99 Everyone Medical
    2463 PTA Content Master MEDICAL 4.20 64 41.00 1000 Paid 29.99 Everyone Medical
    In [16]:
    # create a column called 'Revenue Estimate'
    
    df_gapps_clean["Revenue_Estimate"] = df_gapps_clean['Price'].mul(df_gapps_clean['Installs'])
    df_gapps_clean.sort_values("Revenue_Estimate", ascending=False).head(10)
    
    Out[16]:
    App Category Rating Reviews Size_MBs Installs Type Price Content_Rating Genres Revenue_Estimate
    9220 Minecraft FAMILY 4.50 2376564 19.00 10000000 Paid 6.99 Everyone 10+ Arcade;Action & Adventure 69,900,000.00
    8825 Hitman Sniper GAME 4.60 408292 29.00 10000000 Paid 0.99 Mature 17+ Action 9,900,000.00
    7151 Grand Theft Auto: San Andreas GAME 4.40 348962 26.00 1000000 Paid 6.99 Mature 17+ Action 6,990,000.00
    7477 Facetune - For Free PHOTOGRAPHY 4.40 49553 48.00 1000000 Paid 5.99 Everyone Photography 5,990,000.00
    7977 Sleep as Android Unlock LIFESTYLE 4.50 23966 0.85 1000000 Paid 5.99 Everyone Lifestyle 5,990,000.00
    6594 DraStic DS Emulator GAME 4.60 87766 12.00 1000000 Paid 4.99 Everyone Action 4,990,000.00
    6082 Weather Live WEATHER 4.50 76593 4.75 500000 Paid 5.99 Everyone Weather 2,995,000.00
    7954 Bloons TD 5 FAMILY 4.60 190086 94.00 1000000 Paid 2.99 Everyone Strategy 2,990,000.00
    7633 Five Nights at Freddy's GAME 4.60 100805 50.00 1000000 Paid 2.99 Teen Action 2,990,000.00
    6746 Card Wars - Adventure Time FAMILY 4.30 129603 23.00 1000000 Paid 2.99 Everyone 10+ Card;Action & Adventure 2,990,000.00

    Extracting Nested Data From a column¶

    In [17]:
    # Issue: having multiple Genres separated by ;
    
    df_gapps_clean.Genres.value_counts().sort_values(ascending=True)[:5]
    
    Out[17]:
    Lifestyle;Pretend Play      1
    Strategy;Education          1
    Adventure;Education         1
    Role Playing;Brain Games    1
    Tools;Education             1
    Name: Genres, dtype: int64
    In [18]:
    # split the string on ; and then .stack them
    
    stack = df_gapps_clean.Genres.str.split(";", expand=True).stack()
    
    print(f"We now have a single column shape: {stack.shape}")
    
    We now have a single column shape: (8564,)
    
    In [19]:
    # number of Genres
    
    numb_genres = stack.value_counts()
    print(f"We have: {len(numb_genres)} genres.")
    
    We have: 53 genres.
    
    In [20]:
    print(df_gapps_clean.shape)
    df_gapps_clean.head()
    
    (8184, 11)
    
    Out[20]:
    App Category Rating Reviews Size_MBs Installs Type Price Content_Rating Genres Revenue_Estimate
    21 KBA-EZ Health Guide MEDICAL 5.00 4 25.00 1 Free 0.00 Everyone Medical 0.00
    28 Ra Ga Ba GAME 5.00 2 20.00 1 Paid 1.49 Everyone Arcade 1.49
    47 Mu.F.O. GAME 5.00 2 16.00 1 Paid 0.99 Everyone Arcade 0.99
    82 Brick Breaker BR GAME 5.00 7 19.00 5 Free 0.00 Everyone Arcade 0.00
    99 Anatomy & Physiology Vocabulary Exam Review App MEDICAL 5.00 1 4.60 5 Free 0.00 Everyone Medical 0.00

    What is the structure of your dataset?¶

    The cleaned dataset now have 8184 rows, 11 columns.

    What is/are the main feature(s) of interest in your dataset?¶

    1. How many downloads would a developer give up by making an app paid or free?
    2. Which app categories offer compelling opportunities based on its popularity?
    3. Which Genres are most competitive (Entertainment, Tools, Education, Action, Sports)?
    4. Which content is very competitive (Everyone, Teen, Mature 17+)?
    5. How competitive different categories are(Games, Social, Weather)?
    6. Which paid apps caegory have the highest revenue?

    What features in the dataset do you think will help support your investigation into your feature(s) of interest?¶

    All the features in the dataset will helpful especially those of Price, Type, App, Rating,Category, Installs and Revenue Estimates columns.

    Univariate Exploration¶

    Question 1¶

    Which content have the highest ratings?

    Visualization 1¶

    In [22]:
    content_ratings = df_gapps_clean['Content_Rating'].value_counts()
    
    # plot a pie chart
    plt.figure(figsize=(10, 8))
    figr = px.pie(labels=content_ratings.index,
                 values=content_ratings.values,
                 title="Content Rating",
                names=content_ratings.index);
    
    C:\Users\Moyinoluwa Busola\anaconda3\lib\site-packages\plotly\express\_core.py:137: FutureWarning: Support for multi-dimensional indexing (e.g. `obj[:, None]`) is deprecated and will be removed in a future version.  Convert to a numpy array before indexing instead.
      return args["labels"][column]
    
    <Figure size 720x576 with 0 Axes>
    In [23]:
    # update traces
    figr.update_traces(textposition="outside", textinfo='percent + label');
    
    figr.show()
    

    Observation 1¶

    • The Content Rating for 'Everyone' has 80.7%, while there is a very few 'Unrated' apps at 0.0122%.</p>

    Question 2¶

    What are the top ten (10) category apps?

    Visualization 2¶

    In [24]:
    top_10_categories = df_gapps_clean['Category'].value_counts()
    
    # plot bar chart
    plt.figure(figsize=(12, 10))
    plt.xticks(rotation=30)
    plt.xlabel("Top 10 Categories")
    plt.ylabel("Count")
    plt.title("Top 10 Categories for Apps")
    fig = plt.bar(x=top_10_categories.index[:10], height=top_10_categories.values[:10]);
    
    plt.show()
    

    Observation 2¶

    • Business Category has the least count, while the Family Category has the highest count.

    Question 3¶

    What are the top ten (10) genres?

    Visualization 3¶

    In [26]:
    top_10_genres = df_gapps_clean['Genres'].value_counts()
    
    # plot horizontal bar chart
    
    plt.figure(figsize=(10, 8))
    plt.yticks(rotation=30)
    plt.xlabel("Count")
    plt.ylabel("Genres")
    plt.title("Top 10 Genres")
    
    plt.barh(top_10_genres[:10].index, top_10_genres[:10].values);
    
    plt.show()
    

    Observation 3¶

    • 'Tools' genre tops the list, while 'Sports' genre came the least.

    Question 4¶

    What is the distribution of Size in MBs?

    Visualization 4¶

    In [28]:
    # plot histogram
    
    bins = np.arange(0, df_gapps_clean['Size_MBs'].max()+5, 5)
    plt.figure(figsize=(8, 6))
    plt.xlabel("Size MBs")
    plt.ylabel("Count")
    plt.hist(df_gapps_clean["Size_MBs"], bins=bins);
    
    plt.show()
    

    Observation 4¶

    • The distribution is a normal log distribution.

    Question 5¶

    Which are the most downloaded apps? Are they free or paid?

    Visualization 5¶

    In [29]:
    paid_vs_free_apps = df_gapps_clean['Type'].value_counts()
    
    # plotting a donut chart
    plt.figure(figsize=(10, 8))
    fig = px.pie(labels=paid_vs_free_apps.index,
                 values=paid_vs_free_apps.values,
                 hole=0.6,
                 title="Paid Vs Free Apps",
                names=paid_vs_free_apps.index);
    
    # update traces
    fig.update_traces(textposition="outside", textinfo="percent + label");
    
    fig.show()
    
    C:\Users\Moyinoluwa Busola\anaconda3\lib\site-packages\plotly\express\_core.py:137: FutureWarning:
    
    Support for multi-dimensional indexing (e.g. `obj[:, None]`) is deprecated and will be removed in a future version.  Convert to a numpy array before indexing instead.
    
    
    <Figure size 720x576 with 0 Axes>

    Observation 5¶

    • Most of the apps are 'Free' with 92.8% compared to paid apps with 7.2%.

    Question 6¶

    How are Ratings distributed?

    Visualization 6¶

    In [35]:
    # plot histogram
    
    sb.displot(df_gapps_clean['Rating'], height=4.5);
    

    Observation 6¶

    • The Rating's distribution data is skewed to the right.

    Question 7¶

    How is the 'Size_MBs' distributed?

    Visualization 7¶

    In [37]:
    # plot histogram
    
    sb.displot(df_gapps_clean["Size_MBs"], kind="hist", height=6);
    

    Observation 7¶

    • The Size_MBs' data is skewed to the right.

    Discuss the distribution(s) of your variable(s) of interest. Were there any unusual points? Did you need to perform any transformations?¶

    Most variables are either left skewed or right skewed. There were unusual points in the data which I detected. I performed different transformations on the data in other to get good visualizations.

    Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?¶

    Yes, I did some operations to adjust or change the form of data in order to capture the main features of focus. The processes helped me to make better observations by using the correct data format. There have been an unusual distributions on the data.

    Bivariate Exploration¶

    Question 1¶

    How does the correlation between 'Installs' and number of Apps (Categorie) look like?

    In [38]:
    # here, I will have to create a dataframe to show this relationship, then groupby 'Category' and sum the number of 'Installs'
    
    category_by_installs = df_gapps_clean.groupby('Category').agg({'Installs': pd.Series.sum})
    
    In [39]:
    # groupby Category and count the number of Apps
    
    category_by_num = df_gapps_clean.groupby('Category').agg({'App': pd.Series.count})
    
    In [40]:
    # new dataframe
    
    df_appcombined = pd.merge(category_by_num, category_by_installs, on='Category', how='inner')
    df_appcombined.head()
    
    Out[40]:
    App Installs
    Category
    ART_AND_DESIGN 61 114233100
    AUTO_AND_VEHICLES 73 53129800
    BEAUTY 42 26916200
    BOOKS_AND_REFERENCE 169 1665791655
    BUSINESS 262 692018120

    Visualization 1¶

    In [42]:
    # plot a scatter plot
    
    fig = px.scatter(df_appcombined, # data 
                    x='App', # names
                    y='Installs', 
                    size="App",
                    hover_name=df_appcombined.index,
                    color="Installs", 
                    title="Category Correlation")
    
    fig.update_layout(xaxis_title="Number Of Apps",
                     yaxis_title="Installs",
                     yaxis=dict(type="log"))
    fig.show()
    

    Observation 1¶

    • Most categories correlated more on the left side, compared to a few categories on the right side.

    Question 2¶

    How many Installs are paid compared to the free installs?

    Visualization 2¶

    In [44]:
    # plot box plot
    
    fig = px.box(df_gapps_clean, 
                x="Type", 
                y= "Installs", 
                color = "Type", 
                notched=True, 
                title="How many Installs are paid compared to free?", 
                )
    fig.update_layout(xaxis_title="Type of Install",
                     yaxis_title="Installs", 
                     yaxis=dict(type=("log")))
    fig.show()
    

    Observation 2¶

    • Free Installs are 100 times more compared to Paid Installs.

    Question 3¶

    How much can 'Paid' apps earn?

    Visualization 3¶

    In [45]:
    # plot box plot
    
    paid_gapps = df_gapps_clean[df_gapps_clean["Type"] == "Paid"]
        
    box = px.box(paid_gapps, 
                x="Category", 
                y= "Revenue_Estimate",  
                title="How much can Paid Apps earn per Category?"
                )
    box.update_layout(xaxis_title="Category", 
                     yaxis_title="Paid App BallPark Revenue", 
                      xaxis= {'categoryorder': 'min ascending'},
                      yaxis= dict(type=("log")))
    box.show()
    

    Observation 3¶

    • Family category came top of the chart with a maximum revenue of 70 million dollars.
    • Further to this, different categories made reasonable revenue estimates, hence making App development a luxurious business. For example, the 'Game' category has a revenue estimate of 10 million dollars.

    Question 4¶

    Which type of install have majority count in 'Content_Rating'?

    Visualization 4¶

    In [49]:
    # plot bar chart
    
    def plot():
        plt.figure(figsize=(5,5))
        plt.xticks(rotation=45);
        plt.xlabel("Content Rating")
        plt.ylabel("Count")
        sb.countplot(data=df_gapps_clean, x="Content_Rating", hue="Type");
        
    plot()    
    

    Observation 4¶

    • 'Everyone' has the most/top content with majority of apps been free.

    Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?¶

    A very positive relationship occured within the features. Take for instance, the relationship between Type and Installs. Most Installs are free compared to paid, and also the relationship between Category and Revenue Estimate, especially on paid apps was very varying.

    Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?¶

    Yes, of course. For example, I thought paid apps would not have a high revenue, but I found out in the family category that, paid installs made a whooping revenue lot of $69.9 Milion.

    Multivariate Exploration¶

    Question 1¶

    What is the relationship between App, Category and Type?

    Visualization 1¶

    In [50]:
    # free vs paid apps
    
    free_vs_paid = df_gapps_clean.groupby(["Category", "Type"], as_index=False).agg({"App": pd.Series.count})
    
    # plot a clustured bar chart
    
    plt.figure(figsize=(16, 10))
    plt.xlabel("Category")
    plt.ylabel("App")
    plt.title("Relationship between App, Category and Type.")
    plt.xticks(rotation=90)
    sb.barplot(data=free_vs_paid, x="Category", y="App", hue="Type");
    
    plt.show()
    

    Observation 1¶

    For most the categorieFrom the above relationship, most of the apps are Free compared to Paid apps.

    Question 2¶

    What is the relationship between Installs, Rating and Price?

    Visualization 2¶

    In [54]:
    # plot
    
    plt.figure(figsize=(10, 7))
    plt.xlabel("Rating")
    plt.ylabel("Price")
    plt.scatter(data=df_gapps_clean, x="Rating", y="Price", c="Installs");
    plt.colorbar(label="Installs");
    

    Observation 2¶

    • Most of the installed apps are free i.e ($0), and have a high rating. These are the popular social media apps, which include Instagram, Facebook, WhatsApp, amongst others.

    Question 3¶

    What is the relationship between Installs, Content_Rating and Rating?

    Visualization 3¶

    In [58]:
    # plot
    
    fig = sb.FacetGrid(data=df_gapps_clean, col="Content_Rating");
    fig.map(plt.scatter, "Rating", "Installs", alpha=1);
    

    Observation 3¶

    • Looking at the different contents, 'Everyone' content has a high rating and high installs, which is closely followed by 'Teen' content. Then 'Mature 17+' content with relatively low installs which is similar to the that of 'Everyone 10+' content.

    Question 4¶

    What could a relationship between 6 numeric varibles look like?

    Visualization 4¶

    In [63]:
    # plot the relationship between six (6) numeric variables
    
    apps_stats = ["Rating", "Reviews", "Size_MBs", "Installs", "Price", "Revenue_Estimate"]
    
    def plot_matron():
    
        fig = sb.PairGrid(data=df_gapps_clean, vars=apps_stats)
        fig.map_diag(plt.hist);
        fig.map_offdiag(plt.scatter);
    
    plot_matron()
    

    Observation 4¶

    • It is quite hard to make a concrete conclusion based on the plots, hence just a speculation can be made.

    Question 5¶

    How can the relationship between the 6 numeric variables above be clearly shown?

    Visualization 5¶

    In [64]:
    # plot the relationship between the 6 numeric variables above
    
    plt.figure(figsize=(12, 10))
    plt.title("A Relationship Between Six (6) Variables")
    sb.heatmap(df_gapps_clean[apps_stats].corr(), cmap="viridis_r", vmin=0, annot=True);
    plt.show()
    

    Observation 5¶

    • A positive relationship exists between them.

    Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?¶

    Most of the relationships were positive. Yes, there were features that strengthened each other. An instance is when three or more variables are combined, they strengthen the visuals and the observations, hence making better conclusions.

    Were there any interesting or surprising interactions between features?¶

    Yes, there were interesting interactions between features. For example, when visualizing the relationship between Installs, Rating and Price, most Installs and Ratings were on free ($0) apps.

    Conclusions¶

    Summary¶

    • Most of the app Installs from the dataset are free apps compared to paid apps. Although paid apps accumulated a reasonable amount of revenue, paid installs turned out to be largely lower compared to free apps installs.
    • On the part of content, 'Everyone' content had a large number of installs compared to other contents.
    • Finally, most apps with high ratings and reviews are free, therefore making app development a very exiting field to analyse.

    Steps taken during Data Exploration¶

    • Assesing
    • Cleaning
    • Exploring
    • Analysing
    • Sharing

    After loading my dataset, I thoroughly assessed the data and figured out the main features that I will need to explore and analyse, so as to arrive at better conclusions. All the steps I took are well documented.

    In [ ]:
     
    
    In [ ]: